==============================
Documentation on NomesLex-PT01
==============================

Contents
========
1. Introduction
2. NomesLex-PT01 characterization
3. Person name extraction and classification
4. Validation
5. Licensing terms
6. Acknowledgments


1. Introduction
================
NomesLex-PT01 - is a lexicon of person names made up of 2,027 first names and 8,019 surnames, and corresponding frequencies. These (mostly Portuguese) names were selected from the public list of teachers’ 2009 recruitment, published at the Portuguese Ministry of Education website, https://servicos.dgrhe.min-edu.pt/ConsultaCandidatos/.

NomesLex-PT01 is especially useful for named entity recognition and resolution tasks, involving Portuguese.

2. NomesLex-PT01 characterization 
=================================
NomesLex-PT01 is available in two separate files, which are encoded in UTF8 and formatted as TSV (tab-separated values):

- NomesLex-PT01-firstnames.utf8.tsv (made up of 2,027 first names).
- NomesLex-PT01-surnames.utf8.tsv (made up of 8,019 surnames).

Each line contains a (fully capitalized) name followed by its frequency in the public list of teachers’ 2009 recruitment.

E.g.
Five most-frequent first names:
	MARIA (14374)
	ANA (7966)
	CARLA (2660)
	SANDRA (2300)
	PAULA (2021)

Five most-frequent surnames:
	SILVA (10170)
	SANTOS (6785)
	FERREIRA (5918)
	PEREIRA (5557)
	OLIVEIRA (4614)


3. Person name extraction and classification
============================================
To extract the first names and surnames, we first tokenized the collected full names (92,544 entries), considering the space as separator, then eliminated the common connecting words in Portuguese names: "de", "da", "do", "das", "dos", "d'" and "e". Abbreviations (e.g. "S." and "S") and specific connectors in foreign languages (e.g. "del" and "y", in Spanish) were also removed from the name combinations. The first token was taken as a first name. Considering that in the Portuguese culture, it is very common having two given names, we took the third and subsequent tokens as family names used in surnames, and discarded the second token from the name combinations.


4. Validation
==============
We manually validated the extracted names, in order to correct typographical and spelling errors, mostly concerning accents (e.g. Clìmaco > Clímaco; Conceicao > Conceição; Camoes > Camões; Assunçâo > Assunção).


5. Licensing terms
==================
SentiLex-PT01 is licensed under a Commons Attribution 3.0 License (CC-BY). 


6. Acknowledgments
==================
NomesLex-PT01 was supported in part by the REACTION project, and it was developed by the following researchers:

* David Batista (University of Lisbon, Faculty of Sciences)
* Mário J. Silva (University of Lisbon, Faculty of Sciences)
* Paula Carvalho (University of Lisbon, Faculty of Sciences)
